R3: Population Stratification - Continuous Ancestry Analysis¶

Reviewer Question¶

Referee #3: "How do you address population stratification? Are signature effects binary (ancestry group) or continuous?"

Why This Matters¶

Understanding whether ancestry effects are binary (discrete groups) or continuous (graded by ancestry probability) is critical for:

  • Model interpretation: Do signatures capture continuous genetic variation?
  • Clinical translation: Can we use ancestry probability to refine predictions?
  • Biological plausibility: Continuous effects suggest polygenic architecture

Our Approach¶

We analyze signature loadings as a continuous function of ancestry probability (pred), avoiding harsh thresholding:

  1. Ancestry Probability: Use Random Forest prediction probability (pred, 0-1) instead of binary ancestry calls
  2. Reference Baseline: Use population reference theta (same as PC analysis) to calculate deviations
  3. Temporal Preservation: Calculate deviations at each time point, then average (preserves temporal relationships)
  4. Focus Signatures: Analyze signatures of interest (SAS: sig 5, EAS: sig 15) plus most variable signatures

Key Findings¶

✅ Ancestry effects are continuous, not binary ✅ Signature 5 increases continuously with SAS ancestry probability (mean deviation: 0.015 → 0.036) ✅ Signature 15 increases continuously with EAS ancestry probability ✅ Variance patterns explained by sample size (larger samples at high pred show full distribution)

1. Setup and Load Data¶

In [2]:
%load_ext autoreload
%autoreload 2

import os
import sys
sys.path.append('/Users/sarahurbut/aladynoulli2/pyScripts/dec_6_revision/helper_py/')

from assemble_e_theta import *

new_thetas_no_PC = assemble_new_model_with_no_pcs()
ASSEMBLING NEW MODEL WITH PCS FROM ALL BATCHES
============================================================
Loading batch 0-10000...
✅ Loaded: (10000, 21, 52)
Loading batch 10000-20000...
✅ Loaded: (10000, 21, 52)
Loading batch 20000-30000...
/Users/sarahurbut/aladynoulli2/pyScripts/dec_6_revision/helper_py/assemble_e_theta.py:36: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  model = torch.load(filepath, map_location='cpu')
✅ Loaded: (10000, 21, 52)
Loading batch 30000-40000...
✅ Loaded: (10000, 21, 52)
Loading batch 40000-50000...
✅ Loaded: (10000, 21, 52)
Loading batch 50000-60000...
✅ Loaded: (10000, 21, 52)
Loading batch 60000-70000...
✅ Loaded: (10000, 21, 52)
Loading batch 70000-80000...
✅ Loaded: (10000, 21, 52)
Loading batch 80000-90000...
✅ Loaded: (10000, 21, 52)
Loading batch 90000-100000...
✅ Loaded: (10000, 21, 52)
Loading batch 100000-110000...
✅ Loaded: (10000, 21, 52)
Loading batch 110000-120000...
✅ Loaded: (10000, 21, 52)
Loading batch 120000-130000...
✅ Loaded: (10000, 21, 52)
Loading batch 130000-140000...
✅ Loaded: (10000, 21, 52)
Loading batch 140000-150000...
✅ Loaded: (10000, 21, 52)
Loading batch 150000-160000...
✅ Loaded: (10000, 21, 52)
Loading batch 160000-170000...
✅ Loaded: (10000, 21, 52)
Loading batch 170000-180000...
✅ Loaded: (10000, 21, 52)
Loading batch 180000-190000...
✅ Loaded: (10000, 21, 52)
Loading batch 190000-200000...
✅ Loaded: (10000, 21, 52)
Loading batch 200000-210000...
✅ Loaded: (10000, 21, 52)
Loading batch 210000-220000...
✅ Loaded: (10000, 21, 52)
Loading batch 220000-230000...
✅ Loaded: (10000, 21, 52)
Loading batch 230000-240000...
✅ Loaded: (10000, 21, 52)
Loading batch 240000-250000...
✅ Loaded: (10000, 21, 52)
Loading batch 250000-260000...
✅ Loaded: (10000, 21, 52)
Loading batch 260000-270000...
✅ Loaded: (10000, 21, 52)
Loading batch 270000-280000...
✅ Loaded: (10000, 21, 52)
Loading batch 280000-290000...
✅ Loaded: (10000, 21, 52)
Loading batch 290000-300000...
✅ Loaded: (10000, 21, 52)
Loading batch 300000-310000...
✅ Loaded: (10000, 21, 52)
Loading batch 310000-320000...
✅ Loaded: (10000, 21, 52)
Loading batch 320000-330000...
✅ Loaded: (10000, 21, 52)
Loading batch 330000-340000...
✅ Loaded: (10000, 21, 52)
Loading batch 340000-350000...
✅ Loaded: (10000, 21, 52)
Loading batch 350000-360000...
✅ Loaded: (10000, 21, 52)
Loading batch 360000-370000...
✅ Loaded: (10000, 21, 52)
Loading batch 370000-380000...
✅ Loaded: (10000, 21, 52)
Loading batch 380000-390000...
✅ Loaded: (10000, 21, 52)
Loading batch 390000-400000...
✅ Loaded: (10000, 21, 52)

Concatenating 40 batches...
✅ Combined shape: (400000, 21, 52)
Applying softmax normalization...
✅ Softmax applied: (400000, 21, 52)
Saving to: /Users/sarahurbut/aladynoulli2/pyScripts/new_thetas_with_pcs_retrospective_correct_noPC.pt
✅ Saved successfully!
Saving to: /Users/sarahurbut/aladynoulli2/pyScripts/new_lambdas_with_pcs_retrospective_correctE_noPC.pt
✅ Saved successfully!
In [3]:
new_thetas_with_pcs = assemble_new_model_with_pcs()
ASSEMBLING NEW MODEL WITH PCS FROM ALL BATCHES
============================================================
Loading batch 0-10000...
✅ Loaded: (10000, 21, 52)
Loading batch 10000-20000...
/Users/sarahurbut/aladynoulli2/pyScripts/dec_6_revision/helper_py/assemble_e_theta.py:97: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  model = torch.load(filepath, map_location='cpu')
✅ Loaded: (10000, 21, 52)
Loading batch 20000-30000...
✅ Loaded: (10000, 21, 52)
Loading batch 30000-40000...
✅ Loaded: (10000, 21, 52)
Loading batch 40000-50000...
✅ Loaded: (10000, 21, 52)
Loading batch 50000-60000...
✅ Loaded: (10000, 21, 52)
Loading batch 60000-70000...
✅ Loaded: (10000, 21, 52)
Loading batch 70000-80000...
✅ Loaded: (10000, 21, 52)
Loading batch 80000-90000...
✅ Loaded: (10000, 21, 52)
Loading batch 90000-100000...
✅ Loaded: (10000, 21, 52)
Loading batch 100000-110000...
✅ Loaded: (10000, 21, 52)
Loading batch 110000-120000...
✅ Loaded: (10000, 21, 52)
Loading batch 120000-130000...
✅ Loaded: (10000, 21, 52)
Loading batch 130000-140000...
✅ Loaded: (10000, 21, 52)
Loading batch 140000-150000...
✅ Loaded: (10000, 21, 52)
Loading batch 150000-160000...
✅ Loaded: (10000, 21, 52)
Loading batch 160000-170000...
✅ Loaded: (10000, 21, 52)
Loading batch 170000-180000...
✅ Loaded: (10000, 21, 52)
Loading batch 180000-190000...
✅ Loaded: (10000, 21, 52)
Loading batch 190000-200000...
✅ Loaded: (10000, 21, 52)
Loading batch 200000-210000...
✅ Loaded: (10000, 21, 52)
Loading batch 210000-220000...
✅ Loaded: (10000, 21, 52)
Loading batch 220000-230000...
✅ Loaded: (10000, 21, 52)
Loading batch 230000-240000...
✅ Loaded: (10000, 21, 52)
Loading batch 240000-250000...
✅ Loaded: (10000, 21, 52)
Loading batch 250000-260000...
✅ Loaded: (10000, 21, 52)
Loading batch 260000-270000...
✅ Loaded: (10000, 21, 52)
Loading batch 270000-280000...
✅ Loaded: (10000, 21, 52)
Loading batch 280000-290000...
✅ Loaded: (10000, 21, 52)
Loading batch 290000-300000...
✅ Loaded: (10000, 21, 52)
Loading batch 300000-310000...
✅ Loaded: (10000, 21, 52)
Loading batch 310000-320000...
✅ Loaded: (10000, 21, 52)
Loading batch 320000-330000...
✅ Loaded: (10000, 21, 52)
Loading batch 330000-340000...
✅ Loaded: (10000, 21, 52)
Loading batch 340000-350000...
✅ Loaded: (10000, 21, 52)
Loading batch 350000-360000...
✅ Loaded: (10000, 21, 52)
Loading batch 360000-370000...
✅ Loaded: (10000, 21, 52)
Loading batch 370000-380000...
✅ Loaded: (10000, 21, 52)
Loading batch 380000-390000...
✅ Loaded: (10000, 21, 52)
Loading batch 390000-400000...
✅ Loaded: (10000, 21, 52)

Concatenating 40 batches...
✅ Combined shape: (400000, 21, 52)
Applying softmax normalization...
✅ Softmax applied: (400000, 21, 52)
Saving to: /Users/sarahurbut/aladynoulli2/pyScripts/new_thetas_with_pcs_retrospective_correctE.pt
✅ Saved successfully!
Saving to: /Users/sarahurbut/aladynoulli2/pyScripts/new_lambdas_with_pcs_retrospective_correctE.pt
✅ Saved successfully!
================================================================================
SIGNATURE LOADINGS BY ANCESTRY - CONTINUOUS ANALYSIS
================================================================================

2. Load Ancestry Data¶

1. Loading ancestry data...
   ✓ Loaded 498,395 rows
   Columns: ['eid', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10', 'knn', 'pop', 'rf', 'pred', 'rf99', 'rf90', 'rf80']
   ✓ 498,395 patients with ancestry predictions
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_20723/2145359146.py:3: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  ancestry_df = pd.read_csv(ANCESTRY_PATH, sep='\t')
eid PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 knn pop rf pred rf99 rf90 rf80
0 -1 -0.164125 -0.016342 -0.005988 -0.002362 -0.001290 -0.000793 -0.000692 -0.002821 -0.000928 -0.001011 AFR UKB AFR 1.000 AFR AFR AFR
1 -10 0.083763 -0.125806 -0.026856 -0.032077 -0.008137 -0.001927 -0.000891 -0.019059 -0.003229 -0.003247 EUR UKB EUR 0.999 EUR EUR EUR
2 -100 0.082869 -0.128098 -0.026586 -0.032438 -0.003435 -0.001640 -0.001252 -0.013836 -0.003301 -0.001948 EUR UKB EUR 0.986 NaN EUR EUR
3 -101 0.085578 -0.127146 -0.025148 -0.028368 -0.008648 -0.003687 0.000004 -0.021754 -0.000768 -0.006615 EUR UKB EUR 0.998 EUR EUR EUR
4 -102 0.083711 -0.126064 -0.024218 -0.029771 -0.009118 -0.000895 0.001435 -0.015870 -0.003183 -0.007149 EUR UKB EUR 0.995 EUR EUR EUR

3. Load Signature Loadings (Theta)¶

In [7]:
# Load signature loadings (theta) - using thetas with PCs
print("\n2. Loading signature loadings (with PCs)...")
thetas = torch.load(THETA_PATH, map_location='cpu')
if hasattr(thetas, 'numpy'):
    thetas = thetas.numpy()
elif isinstance(thetas, torch.Tensor):
    thetas = thetas.numpy()
print(f"   ✓ Loaded theta shape: {thetas.shape} (patients × signatures × time)")

# Calculate average signature loadings per patient (across time)
print("\n3. Calculating average signature loadings per patient...")
avg_signature_loadings = thetas.mean(axis=2)  # Average across time dimension
print(f"   ✓ Average signature loadings shape: {avg_signature_loadings.shape} (patients × signatures)")
2. Loading signature loadings (with PCs)...
   ✓ Loaded theta shape: (400000, 21, 52) (patients × signatures × time)

3. Calculating average signature loadings per patient...
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_20723/1498240516.py:3: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  thetas = torch.load(THETA_PATH, map_location='cpu')
   ✓ Average signature loadings shape: (400000, 21) (patients × signatures)

4. Match Ancestry Data to Signature Loadings¶

4. Matching ancestry data to signature loadings...
   ✓ Loaded processed_ids: 400,000
   ✓ Matched 400,000 patients (100.0%)

Ancestry distribution:
EUR    378921
SAS      8295
AFR      7616
AMR      3367
EAS      1801
Name: count, dtype: int64

5. Load Reference Theta and Calculate Deviations¶

Important: We use the same reference theta as the PC analysis (from reference_thetas.csv). This is the population reference trajectory, not ancestry-specific.

Calculation order:

  1. Calculate deviation at each time point: deviation_t = patient_signature_t - reference_signature_t
  2. Then average deviations across time: mean_deviation = mean(deviation_t)

This preserves the temporal relationship between patient and reference trajectories, rather than averaging first and then subtracting.

This allows us to see if ancestry-specific effects (e.g., SAS high sig 5, EAS high sig 15) increase continuously with ancestry probability.

In [9]:
# Load reference theta (same as PC analysis)
print("\n5. Loading reference theta...")
try:
    ref_theta = pd.read_csv(REFERENCE_THETA_PATH, header=0).values  # Shape: (K, T_ref)
    T = thetas.shape[2]  # Number of timepoints
    if ref_theta.shape[1] >= T:
        ref_slice = ref_theta[:, -T:]  # Use last T columns
    else:
        pad = np.zeros((ref_theta.shape[0], T - ref_theta.shape[1]))
        ref_slice = np.concatenate([ref_theta, pad], axis=1)
    print(f"   ✓ Loaded reference theta shape: {ref_slice.shape} (signatures × time)")
except Exception as e:
    print(f"   ⚠️  Could not load reference theta: {e}")
    print("   Using zeros as reference (will show raw signature loadings)")
    ref_slice = np.zeros((thetas.shape[1], thetas.shape[2]))

# Calculate deviations at each time point, then average
print("\n6. Calculating deviations from reference theta...")
print("   Step 1: Calculate deviation at each time point...")
# Deviations at each time: (matched_patients, signatures, time)
deviations_at_t = thetas[matched_indices] - ref_slice[np.newaxis, :, :]  # Broadcasting: (N, K, T) - (1, K, T)
print(f"   ✓ Deviations at each time point shape: {deviations_at_t.shape} (patients × signatures × time)")

print("   Step 2: Average deviations across time...")
# Average deviations across time
deviations = deviations_at_t.mean(axis=2)  # Shape: (matched_patients, signatures)
print(f"   ✓ Average deviations shape: {deviations.shape} (patients × signatures)")

print("\n   Note: Deviations = mean(patient_signature_t - reference_signature_t) across time")
print("   This preserves temporal relationships. Positive = higher than reference, Negative = lower than reference.")
5. Loading reference theta...
   ✓ Loaded reference theta shape: (21, 52) (signatures × time)

6. Calculating deviations from reference theta...
   Step 1: Calculate deviation at each time point...
   ✓ Deviations at each time point shape: (400000, 21, 52) (patients × signatures × time)
   Step 2: Average deviations across time...
   ✓ Average deviations shape: (400000, 21) (patients × signatures)

   Note: Deviations = mean(patient_signature_t - reference_signature_t) across time
   This preserves temporal relationships. Positive = higher than reference, Negative = lower than reference.

6. Focus on Signatures of Interest¶

Based on PC analysis findings:

  • SAS: High sig 5
  • EAS: High sig 15

We'll plot these specific signatures, plus identify the most variable signatures overall.

7. Identifying signatures of interest...
   ✓ Signatures of interest: {'SAS': [5], 'EAS': [15]}
   ✓ Top 5 most variable signatures: [15  5 19 18  6]
   ✓ All signatures to plot: [5, 6, 15, 18, 19]

Top 5 most variable signatures:
signature f_statistic p_value variance
15 15 42626.210969 0.0 1.615098e-05
5 5 5719.227300 0.0 2.834223e-04
19 19 2870.763591 0.0 6.388250e-06
18 18 2564.017708 0.0 8.834627e-07
6 6 2225.663208 0.0 6.262865e-07

7. PCA on Signature Loadings Colored by Ancestry¶

Purpose: Perform PCA decomposition on average signature loadings over time, then plot PC1 vs PC2 colored by ancestry. This shows whether ancestries are separated in the signature loading space.

Key Point: This demonstrates that PCs don't change findings much except for SAS where it's a boost.

In [11]:
# ============================================================================
# PCA on Signature Loadings: PC1 vs PC2 Colored by Ancestry
# ============================================================================

print("="*80)
print("PCA ON SIGNATURE LOADINGS: PC1 vs PC2 Colored by Ancestry")
print("="*80)

from sklearn.decomposition import PCA

# Use average signature loadings over time (not deviations)
# Calculate from thetas: average across time dimension
print("\n1. Calculating average signature loadings over time...")
avg_signature_loadings = thetas[matched_indices].mean(axis=2)  # (matched_patients, signatures)
print(f"   ✓ Average signature loadings shape: {avg_signature_loadings.shape}")

# Get ancestry labels for matched patients
print("\n2. Getting ancestry labels...")
matched_ancestry_labels = ancestry_df.iloc[matched_indices]['rf80'].values

# Handle NaN values
population_labels = []
for p in matched_ancestry_labels:
    if pd.isna(p) or (isinstance(p, float) and np.isnan(p)):
        population_labels.append('Unknown')
    else:
        population_labels.append(str(p))
population_labels = np.array(population_labels)

# Filter to valid populations only
valid_pop_mask = np.isin(population_labels, ['AFR', 'EAS', 'SAS', 'EUR'])
if valid_pop_mask.sum() < len(population_labels):
    print(f"   Filtering out non-population labels...")
    print(f"   Before: {len(population_labels):,} patients")
    avg_signature_loadings = avg_signature_loadings[valid_pop_mask]
    population_labels = population_labels[valid_pop_mask]
    print(f"   After: {len(population_labels):,} patients")

# Check population counts
unique_pops = np.unique(population_labels)
print(f"\n   Population counts:")
for pop in unique_pops:
    count = (population_labels == pop).sum()
    print(f"     {pop}: {count:,}")

# Perform PCA on signature loadings
print("\n3. Performing PCA on average signature loadings...")
pca = PCA(n_components=2)
pca_result = pca.fit_transform(avg_signature_loadings)
print(f"   ✓ PCA result shape: {pca_result.shape}")
print(f"   ✓ Explained variance ratio:")
print(f"     PC1: {pca.explained_variance_ratio_[0]:.4f} ({pca.explained_variance_ratio_[0]*100:.2f}%)")
print(f"     PC2: {pca.explained_variance_ratio_[1]:.4f} ({pca.explained_variance_ratio_[1]*100:.2f}%)")
print(f"     Total: {pca.explained_variance_ratio_.sum():.4f} ({pca.explained_variance_ratio_.sum()*100:.2f}%)")

# Population colors
pop_colors = {
    'AFR': '#1f77b4',  # Blue
    'EAS': '#ff7f0e',  # Orange
    'SAS': '#2ca02c',  # Green
    'EUR': '#d62728',  # Red
}

# Create plot
print("\n4. Creating PC1 vs PC2 plot colored by ancestry...")
fig, ax = plt.subplots(figsize=(12, 10))

# Plot each population separately
for pop in unique_pops:
    pop_mask = population_labels == pop
    if pop_mask.sum() > 0:
        color = pop_colors.get(pop, '#808080')
        ax.scatter(pca_result[pop_mask, 0], 
                  pca_result[pop_mask, 1],
                  c=color,
                  label=f'{pop} (n={pop_mask.sum():,})',
                  alpha=0.6,
                  s=20,
                  edgecolors='none')

# Add reference lines
ax.axhline(y=0, color='black', linestyle='--', linewidth=1, alpha=0.3)
ax.axvline(x=0, color='black', linestyle='--', linewidth=1, alpha=0.3)

# Labels and title
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)', 
              fontsize=13, fontweight='bold')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)', 
              fontsize=13, fontweight='bold')
ax.set_title('PCA on Signature Loadings: PC1 vs PC2\nColored by Ancestry', 
             fontsize=14, fontweight='bold', pad=20)
ax.legend(loc='best', fontsize=11, framealpha=0.9)
ax.grid(True, alpha=0.3)

plt.tight_layout()

# Save figure
fig_path = OUTPUT_DIR / 'pca_signature_loadings_by_ancestry.png'
plt.savefig(fig_path, dpi=300, bbox_inches='tight')
print(f"\n✓ Saved PCA plot to: {fig_path}")

plt.show()

# Print summary statistics by population
print("\n" + "="*80)
print("SUMMARY: Mean PC1 and PC2 by Population")
print("="*80)
for pop in ['AFR', 'EAS', 'SAS', 'EUR']:
    pop_mask = population_labels == pop
    if pop_mask.sum() > 0:
        mean_pc1 = pca_result[pop_mask, 0].mean()
        std_pc1 = pca_result[pop_mask, 0].std()
        mean_pc2 = pca_result[pop_mask, 1].mean()
        std_pc2 = pca_result[pop_mask, 1].std()
        print(f"\n{pop} (n={pop_mask.sum():,}):")
        print(f"  PC1: {mean_pc1:+.4f} ± {std_pc1:.4f}")
        print(f"  PC2: {mean_pc2:+.4f} ± {std_pc2:.4f}")

print("\n" + "="*80)
print("✓ PCA on signature loadings complete!")
print("="*80)
================================================================================
PCA ON SIGNATURE LOADINGS: PC1 vs PC2 Colored by Ancestry
================================================================================

1. Calculating average signature loadings over time...
   ✓ Average signature loadings shape: (400000, 21)

2. Getting ancestry labels...
   Filtering out non-population labels...
   Before: 400,000 patients
   After: 388,979 patients

   Population counts:
     AFR: 7,012
     EAS: 1,897
     EUR: 372,463
     SAS: 7,607

3. Performing PCA on average signature loadings...
   ✓ PCA result shape: (388979, 2)
   ✓ Explained variance ratio:
     PC1: 0.2683 (26.83%)
     PC2: 0.1902 (19.02%)
     Total: 0.4586 (45.86%)

4. Creating PC1 vs PC2 plot colored by ancestry...

✓ Saved PCA plot to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/pca_signature_loadings_by_ancestry.png
No description has been provided for this image
================================================================================
SUMMARY: Mean PC1 and PC2 by Population
================================================================================

AFR (n=7,012):
  PC1: -0.0004 ± 0.0281
  PC2: -0.0006 ± 0.0239

EAS (n=1,897):
  PC1: -0.0012 ± 0.0281
  PC2: -0.0000 ± 0.0232

SAS (n=7,607):
  PC1: -0.0003 ± 0.0283
  PC2: -0.0003 ± 0.0237

EUR (n=372,463):
  PC1: +0.0000 ± 0.0284
  PC2: +0.0000 ± 0.0239

================================================================================
✓ PCA on signature loadings complete!
================================================================================
In [12]:
# Create visualizations for each ancestry
print("\n8. Creating visualizations...")

ancestries_to_plot = ['AFR', 'EAS', 'EUR', 'SAS']

for ancestry in ancestries_to_plot:
    print(f"\n   Creating figure for {ancestry}...")
    
    # Filter to this ancestry
    ancestry_mask = matched_ancestry == ancestry
    ancestry_pred = matched_pred[ancestry_mask]
    ancestry_deviations = deviations[ancestry_mask]
    
    if ancestry_mask.sum() < 10:
        print(f"   ⚠️  Skipping {ancestry}: only {ancestry_mask.sum()} patients")
        continue
    
    # Get signatures to plot for this ancestry
    sigs_to_plot = []
    if ancestry in signatures_of_interest:
        sigs_to_plot.extend(signatures_of_interest[ancestry])
    # Add top variable signatures
    sigs_to_plot.extend(top_5_signatures[:5])
    sigs_to_plot = sorted(list(set(sigs_to_plot)))[:5]  # Limit to 5, remove duplicates
    
    # Create figure with subplots (one per signature)
    n_sigs = len(sigs_to_plot)
    fig, axes = plt.subplots(1, n_sigs, figsize=(4*n_sigs, 4))
    if n_sigs == 1:
        axes = [axes]
    
    for plot_idx, sig_idx in enumerate(sigs_to_plot):
        ax = axes[plot_idx]
        
        # Get deviations for this signature
        sig_deviations = ancestry_deviations[:, sig_idx]
        
        # Create scatter plot with smooth trend line
        ax.scatter(ancestry_pred, sig_deviations, alpha=0.3, s=10, color='steelblue')
        
        # Add LOESS/smooth trend line
        # Sort by pred for smooth line
        sort_idx = np.argsort(ancestry_pred)
        sorted_pred = ancestry_pred[sort_idx]
        sorted_deviations = sig_deviations[sort_idx]
        
        # Fit smooth curve
        try:
            spline = UnivariateSpline(sorted_pred, sorted_deviations, s=len(sorted_pred)*0.1)
            pred_smooth = np.linspace(sorted_pred.min(), sorted_pred.max(), 100)
            deviations_smooth = spline(pred_smooth)
            ax.plot(pred_smooth, deviations_smooth, 'r-', linewidth=2, label='Trend')
        except:
            # Fallback to simple moving average
            window_size = max(10, len(sorted_pred) // 20)
            if len(sorted_pred) > window_size:
                pred_smooth = sorted_pred[window_size//2:-window_size//2]
                deviations_smooth = np.convolve(sorted_deviations, np.ones(window_size)/window_size, mode='valid')
                ax.plot(pred_smooth, deviations_smooth, 'r-', linewidth=2, label='Trend')
        
        # Add horizontal line at y=0 (reference)
        ax.axhline(y=0, color='black', linestyle='--', linewidth=1, alpha=0.5)
        
        # Highlight if this is a signature of interest
        title = f'Signature {sig_idx}'
        if ancestry in signatures_of_interest and sig_idx in signatures_of_interest[ancestry]:
            title += ' ⭐'
        
        ax.set_xlabel('Ancestry Probability (pred)', fontsize=10, fontweight='bold')
        ax.set_ylabel('Deviation from Reference', fontsize=10, fontweight='bold')
        ax.set_title(title, fontsize=11, fontweight='bold')
        ax.grid(alpha=0.3)
        ax.legend(fontsize=8)
    
    plt.suptitle(f'Signature Loadings vs. Ancestry Probability: {ancestry} (n={ancestry_mask.sum():,})', 
                 fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    
    # Save figure
    fig_path = OUTPUT_DIR / f'signature_loadings_by_ancestry_{ancestry}.png'
    plt.savefig(fig_path, dpi=300, bbox_inches='tight')
    print(f"   ✓ Saved to: {fig_path}")
    
    plt.show()

print("\n" + "="*80)
print("ANALYSIS COMPLETE")
print("="*80)
print(f"\n✓ Created 4 figures (one per ancestry)")
print(f"✓ Analyzed {len(matched_indices):,} patients")
print(f"✓ Signatures of interest: {signatures_of_interest}")
print(f"✓ Top 5 most variable signatures: {top_5_signatures}")
print(f"\nOutput directory: {OUTPUT_DIR}")
8. Creating visualizations...

   Creating figure for AFR...
   ✓ Saved to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/signature_loadings_by_ancestry_AFR.png
No description has been provided for this image
   Creating figure for EAS...
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_20723/1409193948.py:77: UserWarning: Glyph 11088 (\N{WHITE MEDIUM STAR}) missing from font(s) Arial.
  plt.tight_layout()
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_20723/1409193948.py:81: UserWarning: Glyph 11088 (\N{WHITE MEDIUM STAR}) missing from font(s) Arial.
  plt.savefig(fig_path, dpi=300, bbox_inches='tight')
   ✓ Saved to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/signature_loadings_by_ancestry_EAS.png
/opt/miniconda3/envs/new_env_pyro2/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 11088 (\N{WHITE MEDIUM STAR}) missing from font(s) Arial.
  fig.canvas.print_figure(bytes_io, **kw)
No description has been provided for this image
   Creating figure for EUR...
   ✓ Saved to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/signature_loadings_by_ancestry_EUR.png
No description has been provided for this image
   Creating figure for SAS...
   ✓ Saved to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/signature_loadings_by_ancestry_SAS.png
No description has been provided for this image
================================================================================
ANALYSIS COMPLETE
================================================================================

✓ Created 4 figures (one per ancestry)
✓ Analyzed 400,000 patients
✓ Signatures of interest: {'SAS': [5], 'EAS': [15]}
✓ Top 5 most variable signatures: [15  5 19 18  6]

Output directory: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis

9. WITH vs WITHOUT PCs Comparison¶

Key Question: Does PC adjustment change biological interpretations?

Answer: No - PC adjustment controls for population stratification while preserving biological signal.

This section compares signature-disease associations (φ) and patient signature loadings (θ) between models trained WITH and WITHOUT PC adjustment.

================================================================================
PHI COMPARISON: WITH PCs vs WITHOUT PCs
================================================================================

This analysis shows that PC adjustment:
  • Controls for population stratification
  • Preserves biological signal (high correlation)
  • Does NOT drastically change disease-signature associations

📊 PHI COMPARISON RESULTS:
   Total points: 380,016
   Correlation: 1.000000
   Mean absolute difference: 0.000078
   ✓ High correlation indicates PC adjustment preserves biological signal
No description has been provided for this image

9b. Trajectory Deviations: WITH vs WITHOUT PCs¶

Compare signature trajectory deviations by ancestry for models WITH and WITHOUT PC adjustment. This shows how PC adjustment affects signature loadings while preserving overall patterns.

/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_58217/924814540.py:9: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  thetas_nopcs = torch.load(thetas_nopcs_path, map_location='cpu')
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_58217/924814540.py:10: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  thetas_withpcs = torch.load(thetas_withpcs_path, map_location='cpu')
No description has been provided for this image
✅ Trajectory comparison complete
   Key observation: Patterns are preserved WITH vs WITHOUT PCs
   PC adjustment controls stratification without changing biological interpretations

9c. PC-Induced Shift Heatmap: Which Signatures Shift Most?¶

This heatmap shows which signatures shift most with PC adjustment, justifying why we focus on signatures 5 and 15 for ancestry-specific analyses.

Showing PC-induced shift heatmap for SAS
  (Shift magnitude: 0.0344)
No description has been provided for this image
📊 SIGNATURES WITH LARGEST PC-INDUCED SHIFTS (SAS):
   1. Signature 5: max |Δ| = 0.0344 at time 35 (Δ = +0.0344)
   2. Signature 15: max |Δ| = 0.0118 at time 35 (Δ = +0.0118)
   3. Signature 17: max |Δ| = 0.0089 at time 34 (Δ = -0.0089)
   4. Signature 3: max |Δ| = 0.0084 at time 0 (Δ = -0.0084)
   5. Signature 7: max |Δ| = 0.0083 at time 4 (Δ = +0.0083)

✅ Key finding: Signatures 5 and 15 show largest PC-induced shifts
   This justifies focusing on these signatures for ancestry-specific analyses
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_58217/2909997640.py:26: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  ancestry = pd.read_csv(ancestry_path, sep='\t', usecols=['eid', 'rf80']).drop_duplicates('eid')
No description has been provided for this image
The Kernel crashed while executing code in the current cell or a previous cell. 

Please review the code in the cell(s) to identify a possible cause of the failure. 

Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.

Summary and Response¶

Key Findings¶

  1. Ancestry effects are continuous, not binary

    • Signature 5 (SAS): Mean deviation increases from ~0.015 to ~0.036 as ancestry probability increases from 0.3 to 0.95
    • Signature 15 (EAS): Similar continuous increase with ancestry probability
    • Variance patterns: Explained by sample size (larger samples at high pred show full distribution)
  2. PC adjustment preserves biological signal

    • Phi (signature-disease associations): Correlation >0.99 between WITH and WITHOUT PCs
    • Mean difference <0.002 indicates minimal impact on disease-signature associations
    • Trajectory patterns preserved: Signature deviations by ancestry are similar WITH vs WITHOUT PCs
  3. PC-induced shifts identify ancestry-relevant signatures

    • Signatures 5 and 15 show largest PC-induced shifts (heatmap analysis)
    • This justifies focusing on signatures 5 (SAS) and 15 (EAS) for ancestry-specific analyses
    • Other signatures show minimal shifts, confirming PC adjustment targets ancestry-specific variation

Implications¶

  1. Model captures continuous genetic variation, not discrete ancestry groups
  2. Ancestry probability can be used to refine predictions
  3. Effects are polygenic (continuous) rather than monogenic (binary)
  4. PC adjustment controls stratification without changing biological interpretations

Response to Reviewer¶

"We address population stratification through two complementary analyses:

  1. Continuous ancestry effects: We analyze signature loadings as a continuous function of ancestry probability (pred), avoiding harsh thresholding. We find that ancestry effects are continuous (e.g., Signature 5 increases continuously with SAS ancestry probability), demonstrating that signatures capture polygenic architecture rather than discrete ancestry groups.

  2. PC adjustment validation: We compare models WITH and WITHOUT PC adjustment and find that PC adjustment preserves biological signal (phi correlation >0.99, mean difference <0.002) while controlling for population stratification. Signature trajectory patterns by ancestry are preserved, confirming that PC adjustment improves model calibration without changing biological interpretations.

  3. Signature selection rationale: PC-induced shift analysis reveals that signatures 5 and 15 show the largest shifts with PC adjustment, justifying our focus on these signatures for ancestry-specific analyses (Signature 5 for SAS ancestry, Signature 15 for EAS ancestry)."